Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] Introduce the ADValue class for templated traceless forward mode #64

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cgraeser
Copy link
Contributor

@cgraeser cgraeser commented Nov 1, 2024

This was initially developed within the Dune module dune-fufem (https://gitlab.dune-project.org/fufem/dune-fufem).

The class adolc::ADValue<T, maxOrder,dim> implementes a traceless forward mode templated with respect to the underlying type T, the maximal tracked derivative order maxOrder and the domain dimension dim. Currently only maxOrder<=2 is implemented.

Using the special value dynamicDim allows to use a runtime dimension. In the latter case the default is to use std::vector as storage. If support for boost::pool is activated, this is used instead.

Notable differences to adtl::adouble are:

  • Being able to use other types that T=double e.g. allows for low/high precision or interval arithmetics.
  • Support for second order derivatives. The interface is also flexible for later implementation of higher order.
  • Fixing the dimension at compile time allows the compiler to do better optimization and generate faster code. Furthermore this guarantees thread-safety (x) by construction and allows to use different dimensions at the same time within a single program.
  • In the dynamic dimension case, the dimension is set per variable, which allows to have different dimensions at the same time. This is implemented by using one boost:pools per dimension instead of a single one for a globally fixed dimension. This also avoids the memory leaks of adtl::adouble when changing the global dimension. Since the pools are thread_local instead of global, this again provides thread-safety (X).

The last point could probably also be implemented for adtl::adouble.

(x) "Thread-safety" hear does not mean that concurrent access to a single ADValue object is safe. Instead if means that concurrent accesss to different objects in different threads is safe and not prone to race-conditions due to internally used global variables.

@cgraeser
Copy link
Contributor Author

cgraeser commented Nov 1, 2024

This is marked as draft, because there are several things that need to be discussed, details that may need to be changed, and features that may need to be added. This includes the following topics:

  • Naming: While the class is placed in the namespace adolc, the naming of types uses CamelCase while adolc uses lower case elsewhere. (Done. The naming was considered fine as is for now.)
  • User interface: This needs to be discussed. This in particular involves the way ADValues are created and initialized and derivatives are read. Currently the whole stored jet is exposed publicly. I'd propose to only expose individual partial derivatives in order to be more flexible for changing the internal data structure later on, e.g. to use a Taylor-based implementation for higher order.
  • Size policy for dynamic mode: Currently the dynamic mode carefully traces and adjusts internal dimensions. This allows to do arithmetics with default constructed values (with internal size=0). While the performance loss seems to be negligible, one may alternatively declare mixing dimensions UB to simplify the code and maybe improve performance. (Done. We keep the size tracking to avoid UB when violating conventions.)
  • Nonlinear functions: Only a small collection is implemented so far. But there's a simple interface to add more. Furthermore no special care has been taken for 'reasonable' return values in kinks or singularities.
  • Tests: Missing so far. (Done. A rudimentary test was added.)
  • Benchmarks against existing AdolC modes may be useful.
  • Documentation only exists as Doxygen class docs but no manual style descriptive text or examples. (Done.The class documentation contains an example now.)
  • C++ standard: Currently this uses C++20 (for spaceship operator) but this could be easily downgraded to C++17 (maybe even C++14) at the expense of some more boiler-plate code. (Done. C++20 is fine.)
  • Build system integration: This is currently a single self-contained header not using any other parts of AdolC. Thus the USE_BOOST_POOL macro has to be set manually so far. (Done.)

@cgraeser
Copy link
Contributor Author

cgraeser commented Nov 7, 2024

Notice that the test failure is unrelated, since this pull request does not change anything about the existing code and the failure also shows up in current master.

@cgraeser
Copy link
Contributor Author

In general one would expect that it is beneficial to exploit symmetry for second order derivatives. To my surprise this is slower for static dimension. My first guess on why computing only one triangle of the Hessian is slower was, that this is due to the conditional index flip when requesting values. It turns out that this is not the case. Even if one only queries the computed triangle and removes the index flip this is still slower. To be precise, using

for(int i=0; i<dim; ++i)
  for(int j=0; j<=dim; ++j)
    ...

did always produce faster code than

for(int i=0; i<dim; ++i)
  for(int j=i; j<=dim; ++j)
    ...

Surprisingly (in the static dim case) the following is still faster than the latter but not fully on par with the former

for(int i=0; i<dim; ++i)
  for(int j=0; j<=dim; ++j)
    if (j>=i)
      ...

I guess that the code for computing the full matrix benefits from auto-vectorization using SIMD operations.
In my tests (local Laplace assembler as Hessian of a local Dirichlet energy) the difference was almost a factor two (with static dim), i.e. this aspect is significant.

For dynamic dimension the situation is different: Here exploiting symmetry was measurably faster.

@cgraeser
Copy link
Contributor Author

Another performance observation:

One can always compute higher order derivatives using nested ADValues. Simply using a nested 1st order dim-variate ADValue to compute the Hessian is slow. But interestingly you can even compute mixed 2nd order derivatives using nested univariate ADValues if you initialize the derivatives appropriately. Then you do one f-evaluation per derivative:

// Raw input vector
using Vector = std::array<double,dim>;

// Nested 1st order univariate AD-value to compute mixed second order derivatives
using Nested = ADValue<ADValue<double,1,1>, 1, 1>;

// AD-aware vector
using ADVector = std::array<NestedAD, dim>;

Vector x = ...
ADVector x_ad;
for(int i=0; i<dim; ++i)
  for(int j=i; j<=dim; ++j)
  {
    // Initialize values of AD-aware vector
    for(int k=0; k<=dim; ++k)
       x_ad[k] = x[k];
    // Track (i,j)-th derivative
    x_ad[i].partial(0).partial() = 1;
    x_ad[j].partial().partial(0) = 1;
    auto y = f_raw(x_ad);
    hessian[i][j] = y.partial(0).partial(0);
    hessian[j][i] = hessian[i][j];
  }

Surprisingly this was in my tests on par with the dynamic size 2nd order ADValue<double,2,dynamicDim> if you the latter also exploits symmetry. For me this put's a little bit in question if one really needs the (internally significantly more complicated) dynamic dimension version.

@TimSiebert1
Copy link
Collaborator

I guess that the code for computing the full matrix benefits from auto-vectorization using SIMD operations.
In my tests (local Laplace assembler as Hessian of a local Dirichlet energy) the difference was almost a factor two (with static dim), i.e. this aspect is significant.

Hmmm interesting. Did your tests include higher dimensions and more complex functions? I'm not sure if I understand what your actual test function looks like. I would expect this to not hold in general. Otherwise, it would be something to keep in mind for the optimization of ADOL-C. I would further expect that this behavior changes for higher derivative (>2) tensors as well.

@TimSiebert1
Copy link
Collaborator

Since we currently dont have a higher-order tape-less approach, its very cool that your approach can compute higher-order derivatives. At some point this could be replaced by an interface without having to nest all types, as with the tape-based method of ADOL-C.

@cgraeser
Copy link
Contributor Author

cgraeser commented Nov 11, 2024

I only tested 2nd order (more is not supported by ADValue currently). But since there's more duplicate derivatives for higher order you're probably right.

I basically used two test cases:

  1. Compute Hessian of the local Dirichlet energy $f(v) = 1/2\int_e |\sum_{i=1}^8 v_i \lambda_i|^2$ where $v$ are the coefficients wrt the first order finite element shape functions $(\lambda_i)$ on a hexahedron $e$.
  2. Compute gradient and Hessian of a nested nonlinear binary function designed to incorporate many "features" that ADValue should support (various combinations of arithmetic operations with constants/non-constant values, default construction, conditionals, interaction with custom matrix types, ...). This currently takes the following form, where Dune::FieldMatrix<T,n,m> is a simple statically sized nxm matrix with scalar entries of type T:
    auto f_raw = [](auto x) {
      using std::sin;
      using std::cos;
      using std::exp;
      using std::log;
      using std::pow;
      using std::fabs;
      using F = std::decay_t<decltype(x[0])>;
      F c1;
      c1 = 13;
      F c2(42);
      F c3;
      c3 = 1;
      Dune::FieldMatrix<F, 3, 3> A;
      for(auto i : Dune::range(3))
        for(auto j : Dune::range(3))
          A[i][j] = sin(x[0]*i + cos(x[1])) + exp(cos(j*x[0]*x[1]))/(1+x[0]*x[0]+x[1]*x[1]) + 2/(1+x[0]*x[0]+x[1]*x[1]);
      F z = 1;
      z *= pow(A.determinant(), 3) + pow(2, sin(x[0]*x[1])) + c1 + c2 + pow(sin(x[0])+2., cos(x[1])+2);
      z *= c3;
      if ((x[0]>0) and (x[0] <= x[1]))
        return F(z);
      else
        return F(0);
    };

@cgraeser
Copy link
Contributor Author

Since we currently dont have a higher-order tape-less approach, its very cool that your approach can compute higher-order derivatives. At some point this could be replaced by an interface without having to nest all types, as with the tape-based method of ADOL-C.

@cgraeser cgraeser closed this Nov 11, 2024
@cgraeser cgraeser reopened this Nov 11, 2024
@cgraeser cgraeser force-pushed the feature/templated-traceless-forward branch from de0872c to 418782f Compare November 15, 2024 10:46
@cgraeser cgraeser marked this pull request as draft November 15, 2024 23:13
@cgraeser cgraeser force-pushed the feature/templated-traceless-forward branch from ffece1a to bc860ee Compare January 1, 2025 21:48
Copy link

codecov bot commented Jan 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 65.57%. Comparing base (5a575a1) to head (bfa0545).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master      #64   +/-   ##
=======================================
  Coverage   65.57%   65.57%           
=======================================
  Files          51       52    +1     
  Lines       26162    26165    +3     
  Branches     1843     1843           
=======================================
+ Hits        17156    17159    +3     
  Misses       9006     9006           
Flag Coverage Δ
unittests 65.57% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cgraeser
Copy link
Contributor Author

cgraeser commented Jan 1, 2025

I just (force-) pushed an updated version. This improves many aspects:

  • Fix a few bugs when doing arithmetic operations between ADValues with different dynamic dimension.
  • One can now also omit the dimension when creating an ADvalues with dynamicDim. It is automatically set to an appropriate value.
  • Significantly simplify the interface of the two (static and dynamic dim) storage backends. Now they are essentially flat arrays that export some information which derivatives are stored.
  • We now only store and compute the lower triangle of the Hessian in the dynamicDim case, while we still use the full matrix in case of a compile-time dimension. These turned out to be the approaches with the best performance in my tests.
  • Doxygen documentation was improved and now contains a simple example.
  • This now uses the USE_BOOST_POOL macro as provided by the build-system. However, in order to activate boost-pool support (e.g. for the test) with CMake, [cmake] Add CMake option to enable boost pool support #93 is needed.

… mode

This was initially developed within the Dune module
dune-fufem (https://gitlab.dune-project.org/fufem/dune-fufem).

The class `adolc::ADValue<T,maxOrder,dim>` implementes a traceless
forward mode templated with respect to the underlying type `T`,
the maximal tracked derivative order `maxOrder` and the domain
dimension `dim`. Currently only `maxOrder<=2` is implemented.

Using the special value `dynamicDim` allows to use a runtime dimension.
In the latter case the default is to use `std::vector` as storage.
If support for `boost::pool` is activated, this is used instead.

Notable differences to `adtl::adouble` are:

* Being able to use other types that `T=double` e.g. allows for
  low/high precision or interval arithmetics.
* Support for second order derivatives. The interface is also
  flexible for later implementation of higher order.
* Fixing the dimension at compile time allows the compiler to
  do better optimization and generate faster code. Furthermore
  this guarantees thread-safety (x) by construction and allows
  to use different dimensions at the same time within a single
  program.
* In the dynamic dimension case, the dimension is set per variable,
  which allows to have different dimensions at the same time. This
  is implemented by using one `boost:pool` per dimension instead
  of a single one for a globally fixed dimension. This also avoids
  the memory leaks of `adtl::adouble` when changing the global
  dimension. Since the pools are `thread_local` instead of global,
  this again provides thread-safety (X).

The last point could probably also be implemented for `adtl::adouble`.

(x) "Thread-safety" hear does _not_ mean that concurrent access
to a single `ADValue` object is safe. Instead it means that concurrent
accesss to different objects in different threads is safe and not
prone to race-conditions due to internally used global variables.

While higher order derivatives could also be computed using
`adtl_hov::adouble`, this requires several calls to compute mixed
derivatives which is significantly slower compared to computing
all at once. Furthermore the latter cannot be used in concurrent
threads and with different dimensions at once.
The test is still missing checks for many features.
@cgraeser cgraeser force-pushed the feature/templated-traceless-forward branch from bc860ee to bfa0545 Compare January 6, 2025 09:00
@cgraeser
Copy link
Contributor Author

cgraeser commented Jan 6, 2025

Since #93 was merged, I rebased, such that that the test for ADValue can be build with boost-pool support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants